Chromosome-level genome assembly of chub mackerel (Scomber japonicus) from the Indo-Pacific Ocean

Chub mackerels (Scomber japonicus) are a migratory marine fish widely distributed in the Indo-Pacific Ocean. They are globally consumed for their high Omega-3 content, but their population is declining due to global warming. Here, we generated the first chromosome-level genome assembly of chub mackerel (fScoJap1) using the Vertebrate Genomes Project assembly pipeline with PacBio HiFi genomic sequencing and Arima Hi-C chromosome contact data. The final assembly is 828.68 Mb with 24 chromosomes, nearly all containing telomeric repeats at their ends. We annotated 31,656 genes and discovered that approximately 2.19% of the genome contained DNA transposon elements repressed within duplicated genes. Analyzing 5-methylcytosine (5mC) modifications using HiFi reads, we observed open/close chromatin patterns at gene promoters, including the FADS2 gene involved in Omega-3 production. This chromosome-level reference genome provides unprecedented opportunities for advancing our knowledge of chub mackerels in biology, industry, and conservation.

disjunct geographical areas 9 , making them suitable for comparative genetic studies.Despite their ecological and commercial value, the population size of chub mackerel has recently declined 11 due to climate change affecting optimal habitat conditions and temperature-dependent hatching rates 12 , placing the genetic resources of chub mackerel at stake.
Here, we constructed a chromosome-level genome assembly of a male chub mackerel individual (fScoJap1) collected from the South Sea of South Korea (Fig. 1c).We extracted genomic DNA from five different tissues and performed sequencing using PacBio long high-fidelity (HiFi), Illumina and Arima Hi-C technologies, following the Vertebrate Genomes Project (VGP) assembly standard pipeline v2.0 13,14 (Fig. 2a).The estimated genome size using GenomeScope 15 on Illumina genomic reads was 810 Mb (Fig. 2b), while on HiFi reads was shorter (628 Mb) (Fig. 2c).The underestimation of genome size with HiFi reads is consistent with patterns seen in other recent high-quality genome assemblies [16][17][18][19][20][21][22] (Supplementary table S2), most prominent in teleost fishes (Actinopterygii).The recent study on the HiFi assembly of the closest species to chub mackerel, Atlantic chub mackerel, only made a genome size estimation using Illumina reads 23 .The Hi-C mapping allowed reflection of 3D structural distances within each chromosome (Fig. 2d,e).We assembled genome sequences totalling 828.68 Mb in length, which is comparable to the 814.07 Mb assembly of its closest relative, Atlantic chub mackerel 23 .The assembly yielded 24 distinct chromosomal scaffolds (Fig. 2d, Table 1) mostly supported by telomeres at their 5′ and 3′ ends, except for chromosome 10 (Fig. 3, Table 2).We annotated a total of 31,656 genes, including 30,506 protein-coding genes (Table 3) and observed suppression of DNA transposon elements within duplicated genes (Fig. 3a).By examining the 5-methylcytosine (5mC) profile in gene promoter regions using HiFi read data, we gained insight into the open/close chromatin structures associated with a tRNA cluster (Fig. 4) and Omega-3 production genes (Fig. 5).Overall, the chub mackerel genome assembled in this study represents a valuable genetic resource with implications for various fields, including biology, industry, and conservation.

Methods
Sample collection, library construction, and sequencing.Brain, gill, muscle, liver and gonad tissues of a male chub mackerel caught in juvenile stage and farmed in Se-Bo Su-San near Dara National Park, Gyeongsangnam-do, South Korea (34°46′15.8″N, 128°23′54.0″E) (Fig. 1c) were collected on July, 2019.Samples were stored at −80 °C until genomic DNA was extracted using Circulomics Nanobind Tissue Big DNA Kit from brain and muscle tissues for PacBio HiFi and Arima Hi-C sequencing, respectively.We anaesthetized the animal with ethanol and sacrificed with guillotine to minimize pain, followed by tissue dissections; all protocols followed the guideline for animal care of Pukyong National University.Quantity and quality of DNA was determined by Qubit 3 Fluorometer and Agilent Fragment Analyzer.Two PacBio HiFi libraries with insert size of 16,000 bp were generated with 7.5 μg of genomic DNA using SMRTbell ® express template prep kit 2.0.The library was sequenced on a PacBio Sequel II system and 44 Gb of HiFi (QV ≥ 20) data was generated with 49 × coverage and an average read length of 14,000 bp 24 .Additionally, 80.68 Gb of Hi-C data with 89.64 × coverage from the same sample was generated with Arima Hi-C v2.1 24 (Table 4).
Geographical distribution map.Integrated information of every recorded occurrence of chub mackerel was retrieved from Ocean Biodiversity Information System (OBIS) database 25 .Citations for subsets of every dataset are provided in Supplementary table S1.The geographic distribution map (Fig. 1b,c) was visualized using rnaturalearth package 26 for R 27 by plotting coordinate information of OBIS data for mackerel occurrences on the world map.
Genome assembly.The fScoJap1 genome was assembled through VGP standard pipeline v2.0 (https:// training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html) 13,14Fig. 2a).Bionano optical mapping was excluded because it did not produce sufficient quality long-molecule maps, which occurs for some species.The genome size was estimated to be 810,576,028 bp and 688,600,335 bp by GenomeScope 15 with k = 21 using Illumina and HiFi unassembled reads 24 , respectively (Fig. 2b,c).The tendency for genome size to be substantially underestimated when predicted by HiFi reads is prevalent in other species of various lineages 16,17 , with the biggest differences seen in fish [18][19][20][21][22] (Supplementary table S2).Such discrepancies are likely due to genomic regions that HiFi provides less coverage compared to Illumina 28 .Nonetheless, those regions are constructed with high accuracy in the final genome assembly, and thus the final genome size (Table 1) is larger than that predicted using HiFi reads (Fig. 2c) and closer to that predicted using Illumina reads (Fig. 2b).
Third, the remaining primary contigs were scaffolded (p1 → s) using Hi-C data with salsa v2.3 35,36 (Fig. 2d,e).Only the primary assembly (p1) was scaffolded, as the alternate (p2) contains just the alternate haplotype pieces of contigs that are not as complete as the primary.QUAST analysis after Hi-C scaffolding indicated that s comprised a total of 762 contigs (N50 = 22,224,178 bp).QV and completeness evaluated using Merqury were 23.2014 and 99.8512%, respectively for s (Supplementary table S3).
Last, the draft assembly was decontaminated and manually curated using gEVAL v2.2.0 37 (Fig. 2f).After 69 breaks, 463 joins and removal of 7 erroneously remaining duplicated contigs, the scaffold N50 was increased by 56% to 34.6 Mb and the scaffold count reduced by 53% to 360.Of the manually curated assembly, 98.9% could be assigned to 24 identified chromosomes, which were named according to synteny with the closely related Thunnus maccoyii (Southern bluefin tuna) assembly GCF_910596095.1.After manual curation, the curated assembly was 828,697,720 bp, containing 361 scaffolds with a scaffold N50 of 34,636,535 bp (Supplementary table S3).The manually curated assembly was uploaded on GenBank under accession GCA_027409825.1 38 , where the NCBI team removed some microbial contaminating contigs.The further decontaminated assembly was 828,681,152 bp, containing 1,932 contigs with contig N50 of 4,898,551 bp and 360 scaffolds with scaffold N50 of 34,636,535 bp (Table 1, Supplementary table S3).NCBI annotated this assembly under accession GCF_027409825.1 39 .All downstream analyses were carried out on the final assembly.database (http://telomerase.asu.edu/sequences_telomere.html).Soft-masked repeats and telomeric sequences located on telomeric regions (30 kb ends of chromosomes) of every chromosome were counted by an in-house Python script (https://github.com/chulbioinfo/fScoJap1) 40.
To evaluate if chromosomes were properly assembled and partitioned, we investigated telomeric repeats at the ends of each chromosome.437,667 occurrences of telomeric repeat sequence for the Scombriformes clade ' AACCCT' or its complementary ' AGGGTT' were identified throughout the genome with tidk.With an exception of the 3′ telomere of chromosome 10, all chromosomal telomeres of fScoJap1 assembly contained the telomeric repeat sequences (Fig. 3a, Table 2), suggesting that chromosomes were properly assembled end to end.For example, chromosome 1 had 907 and 772 copies of (ACCCTT)n telomeric repeats at the 5′ and 3′ ends, respectively (Fig. 3b-d).

Repeat annotation.
All repetitive regions of the fScoJap1 genome were located, soft-masked and incorporated in the assembly with WindowMasker 41 .Specific repetitive elements and their numbers were identified with RepeatMasker v4.1.5 42using Dfam v3.7 43 library for zebrafish (Danio rerio).Overall, 261,419,747 bp of sequences composing 31.55% of the assembly were masked as repeats by WindowMasker (Fig. 3a).Repetitive elements classified as specific repeat classes and families identified by RepeatMasker totaled 111,477,307 bp (Table 5), including 144,914 DNA transposons, totalling 18,619,431 bp.There was an overall tendency for repetitive elements to be concentrated at the telomeric regions of chromosomes (Fig. 3a).

Gene annotation. The assembled fScopJap1 genome was annotated through NCBI Eukaryotic Genome
Annotation Pipeline v10.1 44 .For gene prediction, experimental evidences retrieved from Entrez Nucleotide, Entrez Protein and SRA of NCBI were aligned to the fScoJap1 genome.52 GenBank transcripts and 304 EST sequence data from dbEST of chub mackerel were aligned using Splign 45 .RNA-Seq reads from 11 chub mackerel liver samples (NCBI Accession: SAMN08995495, SAMN08995496, SAMN08995497, SAMN08995498, SAMN08995499, SAMN08995500, SAMN08995501, SAMN08995502, SAMN08995503, SAMN08995504, SAMN10118436), one Atlantic chub mackerel liver sample (NCBI Accession: SAMN08159728), one Atlantic mackerel (Scomber scombrus) liver sample (NCBI Accession: SAMN12342693) and one Atlantic mackerel white muscle sample (NCBI Accession: SAMN04992872) were aligned using STAR 46 .RefSeq proteins of siamese fighting fish (Betta splendens), ray-finned fish (Actinopterygii), zebrafish, northern pike (Esox lucius), southern platyfish (Xiphophorus maculatus) and human (Homo sapiens) and GenBank proteins of ray-finned fish and human were aligned using ProSplign 47 .The annotation was uploaded on NCBI RefSeq with annotation ID "GCF_027409825.1-RS_2023_01." Duplication.Duplicated genes were identified using a wrapper for MCScanX 48 provided in TBtools-II v1.113 49 by searching for BLASTP matches within the fScoJap1 genome with the number of BLASTP hits for a gene restricted to five and an E-value cutoff set to 10 −10 .Only coding sequences (CDSs) with start and stop codons which totalled to 23,774 were analyzed and further classified according to a classification procedure by Wang et al. 48: WGD/segmental if it is an anchor gene in a collinear duplication; tandem duplicates if the corresponding duplicate is the gene adjacent on the chromosome; proximal if the duplicate is less than 20 genes apart; and dispersed for every other duplicated genes (Table 6).
A total of 19,994 genes contain various duplications classified into 13,158 dispersed, 1,092 proximal, 2,873 tandem and 2,871 WGD/segmental duplications, respectively (Table 6).Visual inspection of the circus plot suggested an overall tendency for genic duplications to be less in regions of the genome where transposons were located (Fig. 3a).To quantify this, we calculated the total length of transposons in duplicated genic regions of the genome compared to other regions.Whole genic regions had lower proportion overlapped with transposon elements (2.03%) than did whole intergenic regions (2.56%).Within the genic regions, the percentage of duplicated genic regions covered by transposon elements (1.30%) were almost twice as less than the percentage of singleton genic regions covered by transposons (2.37%; Table 7), suggesting a disposition of transposons to overlap less with duplicated genes.This finding is intriguing, as it is counterintuitive to the fact that transposons are in part responsible for forming new gene duplications 50 .
GC content and DNA methylation.Methylation profiles were identified by kinetic signatures imprinted on HiFi reads which specify positions of CpG sites and probabilities of 5mC modifications.The 5mC modification information of HiFi reads were read by primrose v1.3.0 51 which generated an identical set of HiFi reads with the information tagged as BAM tags.The tagged reads were aligned to the chub mackerel assembly, sorted and indexed by pbmm2 v1.10.0 (https://github.com/PacificBiosciences/pbmm2).Complete list of CpG sites and their 5mC modification probabilities based on the aligned tagged reads were generated by pb-CpG-tools v1.1.0(https://github.com/PacificBiosciences/pb-CpG-tools/),which calculated discretized modification probabilities based on the estimated ratio of reads mapped to the corresponding CpG site tagged as modified to those tagged as not modified.CpG islands were identified by 'newcpgreport' function of EMBOSS: 6.5.7.0 (http://emboss.bioinformatics.nl/cgi-bin/emboss/newcpgreport).
Genes are known to have differential methylation of CpG islands on promoters which affect transcription initiation in many genes 52 .All CpG sites were located and further classified as hyper-(>75%), hetero-(25%~75%) or hypo-methylated (<25%) discretized from 5-methylcytosine (5mC) modification probability.In total, 10,636,128 CpG sites were identified, of which 7,271,538 were likely, 2,108,856 were moderately likely, and 1,255,734 were unlikely methylated (Fig. 3a).A total of 35,728 CpG islands were found throughout the genome which summed to 10,839,030 bp in length (Fig. 3a).
A substantial number of CpG sites were found located on genes or supposable promoter regions of genes (≤1,000 bp upstream of transcription initiation site; Fig. 3a).For example, we found 118 CpG islands each covering one of 158 tRNA genes clustered in an approximately 80,000 bp long region between loci 5,019,165 and 5,098,985 bp on chromosome 3 (3:5,019,165-5,098,985) of the fScoJap1 genome (Fig. 4a).Such case is accordant with an observed tendency for human tRNA genes to have relatively short CpG islands located on them that cover all of the transcription units 53 .Whereas the CpG islands on the tRNA cluster 3:5,019,165-5,098,985 were heavily methylated, apparent by overall skew of CpG sites in the region towards being likely methylated (Fig. 4a), the CpG islands on promoter regions of several nearby genes of the chromosome were relatively unmethylated (Fig. 4b,d).For some genes, although the promoter region lacked a CpG island, the CpG sites at those regions were unmethylated (Fig. 4c,e).Such cases imply non-repression of expressions of those genes 54 .
The DNA hypo-methylation on promoters imply possibilities for new biological insights.For example, the Fads2 gene (located on 5:11,002,529-11,008,894 in fScoJap1 genome) is expected to be highly expressed in the chub mackerel because it is known to be associated with synthesis of docosahexaenoic acid (DHA), a type of omega-3, a polyunsaturated fatty acid 55 and a highly-valued nutritional component of chub mackerel.Fads2 genes code for desaturase enzymes to synthesize long-chain polyunsaturated fatty acids including DHA by introducing double bonds to endogenous fatty acids, causing them to become polyunsaturated 56 .Accordingly, we found the promoter region of Fads2 gene to be relatively non-methylated (Fig. 5).that 3,598 of 3,640 conserved single-copy genes in vertebrata were present in the final assembly, of which 3,537 were single-copies, 34 were duplicated, and 27 were fragmented (Supplementary table S3).Genes of fScoJap1 assembly were predicted via model-based and ab initio procedures with Gnomon 57 using an HMM-based algorithm to build annotation "GCF_027409825.1-RS_2023_01." The final gene set contained 31,656 genes with a mean length of 13,356 bp.Mean lengths of coding sequences (CDSs), exons and introns were 1,911, 228 and 1,682, respectively.There was a total of 258,465 exons in the genome and the mean number of exons per gene was 13.2715 (Table 3).BUSCO was run on "protein" mode using actinopterygii_odb10 lineage dataset (https://busco.ezlab.org/list_of_lineages.html) to assess the completeness of the prediction of gene annotation "GCF_027409825.1-RS_2023_01." Results of BUSCO analysis yielded a value of 99.1% (complete = 98.4%, single-copy = 97.3%,duplicated = 1.1%, fragmented = 0.7%, missing = 0.9%, genes = 3,640) (Table 8).

Fig. 1
Fig. 1 Morphological features, worldwide occurrences and sampling location of chub mackerel.(a) Morphology of chub mackerel provided from the Marine Fish Resource Bank of South Korea (MFRBK).(b) Locations of worldwide occurrences of chub mackerel.(c) Local map of the sampling location of the chub mackerel individual of fScoJap1 assembly marked as a blue star mark in South Korea (34°46′15.8″N, 128°23′54.0″E).Each red dot on the map represents an occurrence location.Some dots were shaded (30% transparency) to display overlapping dots.

Fig. 2
Fig. 2 Genome assembly process to build a reference genome of chub mackerel (fScoJap1).(a) VGP standard assembly pipeline v.2.0.with PacBio HiFi and Arima Hi-C data.Transformed linear GenomeScope profile plots of fScoJap1 genome generated with Illumina short reads (b) and PacBio HiFi reads (c).Pretext contact Hi-C maps of the duplication-removed contigs of fScoJap1 named as 'p' (d), the scaffolds of fScoJap1 linked by Hi-C named as 's' (e) and the final curated assembly of fScoJap1, ordered by chromosome numbers, named as 'pri.cur'(f).

Fig. 3
Fig. 3 Chromosome-level scaffolds in fScoJap1 genome assembly.(a) Circos plot of 24 chromosomes.From the outermost track, each track represents: chromosome lengths, all repeats, telomeric repeats, gaps, GC content, likely methylated CpG sites, moderately likely methylated CpG sites, unlikely methylated CpG sites, CpG islands, genes, DNA transposon elements and synteny links.The coordinate of the circos plot is indicated by the ticks on the chromosomal (the outermost) track.Each minor tick on the outer side of the chromosome represents 2 Mbp and each major tick represents 10 Mbp.All tracks are quantile scaled.For each track, the intensity of color represents the percentage of the bases occupied by the feature in every 100,000 bp window of the corresponding region of the genome.(b) repeats and telomeric repeats in chromosome 1.c. telomeric sequences at 5′ (c) and 3′ (d) ends in chromosome 1.

Fig. 5
Fig. 5 DNA hypo-methylations on the promoter of Fads2 gene.Local view of 12 kb region on chromosome 5:11,002,496-11,015,040 of fScoJap1 genome containing the promoter region of Fads2 gene.In descending order, each track pertains to genes, CpG islands, and CpG sites with three different classes of 5mC modification probabilities (>75%, 25-75% and <25%, respectively).

Table 1 .
Summary statistics of fScoJap1 assembly.More details in supplementary tableS3.
Telomeric repeats.Number of telomeric repeats for every 10,000 bp windows of the genome were identified with tidk v0.2.1 (https://github.com/tolkit/telomeric-identifier)by searching for forward and reverse matches with the telomeric repeat sequence for the Scombriformes clade (' AACCCT') obtained from the telomeric repeat

Table 3 .
Gene annotation of fScoJap1 assembly.

Table 4 .
Raw sequencing data of fScoJap1.

Table 7 .
Regions overlapped by transposon elements for duplicated genes with respect to other genes.